One should look for what is and not what he thinks should be. (Albert Einstein)
Welcome!
Welcome: Icebreaker!
In the chat, discuss the following questions:
1. Why are you taking this course and what would you like to learn?
2. How do you use classification for your job? Think about specific tasks and goals.
3. What do you know about random forest algorithm and how it is used?
Best practices for virtual classes
Find a quiet place, free of as many distractions as possible. Headphones are recommended.
Stay on mute unless you are speaking.
Remove or silence alerts from cell phones, e-mail pop-ups, etc.
Participate in activities and ask questions. This will remain interactive!
Give your honest feedback so we can troubleshoot problems and improve the course.
What level of proficiency do I need?
To use programming as a tool in your professional toolkit, you don’t need to be a computer scientist or have a similar level of knowledge as one
The level of proficiency will depend on
the problems you are trying to solve on daily basis
the subject matter area you are in
the level of sophistication of your solutions that you would like to implement
Most of the time, people who are subject matter experts who also use various programming tools and languages are known as data analysts or data scientists
What are the problems you are trying to solve? What is your area of expertise? What level of complexity would you like your programmatic solution to have?
A data scientist can
Pose the right question
Wrangle the data (gather, clean, and sample data to get a suitable data set)
Manage the data for easy access by the organization
Explore the data to generate a hypothesis
Make predictions using statistical methods such as regression and classification
Communicate the results using visualizations, presentations, and products
A data scientist needs to be able to
Use programming languages and tools to
Wrangle the data (gather, clean, and sample data to get a suitable dataset)
Manage the data for easy access by the organization
Explore the data to generate a hypothesis
Make predictions using statistical methods such as regression and classification
Stemming from the list above, the programming skills should cover knowing a programming language (or two, or three, or …) to a degree that allows you to perform these operations!
Data science control cycle (DSCC)
There is a protocol or standard for working with data that most data scientists follow
The cycle involves everything from asking the right questions and being knowledgeable about the data you’re studying, to optimizing your model’s performance
Which part of the cycle do you think takes up the most time?
Module completion checklist
Objective
Complete
Introduce random forest and discuss use cases
Summarize the concepts associated with random forest and bagging
Load dataset and implement random forest
For each module, we’ll start out with our objectives for the session, so you know what to expect! - Let’s get started!
Loading packages
Let’s load the packages we will be using
These packages are used for classification using random forests, boosting and other tools
import osimport pickleimport matplotlib.pyplot as plt import numpy as np import pandas as pdfrom pathlib import Pathfrom textwrap import wrapfrom sklearn.model_selection import train_test_splitfrom sklearn import metrics# Random forest and boosting packagesfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import GradientBoostingClassifier
Random forest
What is Random forest?
Ensemble method used for classification and regression tasks
Supervised learning algorithm which builds multiple decision trees and aggregates the result
Uses a technique called Bootstrap Aggregation, commonly known as Bagging
Limits overfitting and bias error
Random forest: use cases
The random forest algorithm is used in a multitude of industries such as banking, medicine, e-commerce, etc.
Some examples are:
Fraud detection
Identifying a disease by analyzing patient’s medical history
Predicting the behavior of the stock market
Understanding whether a customer will buy a product or not
Decision trees process
You probably already know how a decision tree works!
Decision trees predict the target value of an item by mapping observations about the item
Below is a brief overview of decision trees as they are related to random forests
Decision trees
Decision trees are great when used for:
Classification and regression
Handling numerical and categorical data
Handling data with missing values
Handling data with nonlinear relationships between parameters
Decision trees are not very good at:
Generalization: they are known for overfitting
Robustness: small variations in data can result in a different tree
Mitigating bias: if some classes dominate, trees may be unbalanced and biased
Why should we use random forest?
Reduction in overfitting
Higher predictive accuracy
Efficient with large datasets
Why should we use decision trees instead?
Intuitive and easily interpretable results
Less computationally expensive algorithm
Module completion checklist
Objective
Complete
Introduce random forest and discuss use cases
✔
Summarize the concepts associated with random forest and bagging
Load dataset and implement random forest
Why is random forest popular?
It uses many decision trees on different subsections of the dataset and averages out the results to improve the predictive accuracy and control overfitting
“Bagging” is an ensemble method that adopts the bootstrap sampling technique, which creates new datasets by using random sampling with replacement
The Out of Bag error rate for the forest of trees is used as a metric to assess the algorithms’ performance
It uses a built-in form of multi-fold cross-validation method
Bagging observations
Bootstrap aggregation is the process that makes up bagging
Bootstrap sampling technique creates new datasets by random sampling with replacement
Bagging within CART lets you choose how many trees, i.e., how many bootstrapped sampled training sets to create
Random forest: bagging
Bagging of observations in itself is pretty cool!
But that’s not all it does…
Bagging is not enough
With using just bagging, random forests can have a lot of structural similarities with decision trees and, in turn, have a high bias (a known drawback of tree algorithms!)
Sample predictors as well!
The true power of random forests vs CART is the limitation of predictors
For each tree, both samples of observations and random samples of features are used instead of using the entire set of features every time
The resulting model becomes unbiased due to a good tree variety, where no variable dominates!
Building the forest
The two main parameters we need to set to build a random forest are:
Number of trees
Number of features per tree
We can stick with these rules:
N of trees - the more the better, but a good rule of thumb is \(n\approx{100}\), where \(n =\)\(number\)\(of\)\(trees\)
N of features per tree - the rule of thumb here is \(m=\sqrt{p}\), where \(p =\)\(number\)\(of\)\(predictors\)
Random forest methodology
Random forest classification
Handling regression with random forest
For this module, we will focus on classification
Knowledge Check 1
Module completion checklist
Objective
Complete
Introduce random forest and discuss use cases
✔
Summarize the concepts associated with random forest and bagging
✔
Load dataset and implement random forest
Datasets for today
We will be using two datasets in class today, one for in-class practice and the other for self guided-exercise work
A dataset in class to learn the concepts
Costa Rica household poverty data by the Inter-American Development Bank
A dataset for our in-class exercises
Bank marketing dataset
Related to direct marketing campaigns of a Portuguese banking institution
The target variable is whether the user subscribed to the bank term deposit (‘yes’) or not (‘no’)
Costa Rican poverty: case study
We will be diving into a case study from the Inter-American Development Bank (IDB)
The IDB conducted a competition amongst data scientists on Kaggle.com
Many countries face this same problem of inaccurately assessing social needs
The following case study on Costa Rican poverty levels is a good example of how we can use data science within social sciences
Costa Rican poverty: backstory
Costa Rican poverty level prediction
As stated by the IDB:
Social programs have a hard time making sure the right people are given enough aid
It’s especially tricky when a program focuses on the poorest segment of the population
The world’s poorest typically can’t provide the necessary income and expense records to prove that they qualify
Costa Rican poverty: backstory
Proxy Means Test (or PMT) is an algorithm used to verify income qualification
PMT uses a model that considers a family’s observable household attributes (e.g. building materials) or assets to classify them and predict their level of need
While this is an improvement, accuracy remains a problem as the region’s population grows and poverty declines
Costa Rican poverty: proposed solution
To improve on PMT, the IDB built a competition for Kaggle participants to use methods beyond traditional econometrics
The given dataset contains Costa Rican household characteristics with a target of four categories:
extreme poverty
moderate poverty
vulnerable households
non-vulnerable households
Costa Rican poverty: proposed solution
The goal is to develop an algorithm to predict these poverty levels, that can eventually be used not only on the Costa Rican population, but on other countries facing the same problem
We will work with the Costa Rican dataset and see what we can develop. We will:
Clean the dataset
Wrangle the data and create classification models to predict poverty level
Predicting poverty
Ultimate goal
Understand the patterns and groups within the dataset
Predict poverty levels of Costa Rican households and build a model that is reproducible for other countries
Data cleaning steps
Today, we will first clean the Costa Rican dataset
The steps to get to this cleaned dataset are:
Remove household ID and individual ID
Remove variables with over 50% NAs
Transform target variable to binary
Remove highly correlated variables
Directory settings
In order to maximize the efficiency of your workflow, you should encode your directory structure into variables
We will use the pathlib library
Let the main_dir be the variable corresponding to your course folder
Let data_dir be the variable corresponding to your data folder
# Set 'main_dir' to location of the project folderhome_dir = Path(".").resolve()main_dir = home_dir.parent.parentprint(main_dir)
We will be using the well known Python library scikit-learn today
Scikit-learn is used for many machine learning algorithms
Here is a quick overview of some of the popular methods scikit-learn touches
ML methods
Purpose
Clustering
Unsupervised learning methods such as k-means
Classification and regression
Supervised learning methods like generalized linear models, logistic regression, support vector machines, and decision trees
Cross validation
Estimating the performance of supervised models
Dimensionality reduction
Feature selection and feature extraction methods
Ensemble methods
Combining predictions of multiple supervised models
Parameter tuning
Adjusting model parameters to get the most out of models
Manifold learning
Summarizing and depicting complex multi-dimensional data
Scikit-learn: random forest
We will be using the RandomForestClassifier library from scikit-learn
The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement as long as bootstrap = True (default)
# Select the predictors and target.X = costa_tree.drop(['Target'], axis =1)y = np.array(costa_tree['Target'])# Set the seed to 1.np.random.seed(1)# Split into the training and test sets.X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.3)
RandomForestClassifier
We introduced the package RandomForestClassifier earlier today
We are now going to use it to build a random forest on our clean data
First, let’s look at the methods available once the model is built
We are going to:
Build the random forest model
Fit the model to the training data
Predict on the test data using our trained model
Building our model
Let’s build our random forest model and use all default parameters for now, as our baseline model